TO 205 552 



TB BIO a31 



TiOTHOP 
TITLE 

TNSTITOTTON 
SPONS AGENCY 
POB DATE 
NOTE 



PoaaiOr John P.: And others ^ 

An Empirical Tnvestiaation of the Angoffr Ebel and 

Nedelsky Standard Settina nethods. 

Kansas Univ. ,p Lawrence. 

Kansas State Dept. of EducatioHr Topeiia. 
Apr B1 

29p.: Paper presented at the Annual Meeting of the 
American Educational Pesearch Association (65thr Los 
Anaeles, CA^ April 13-17, 1991). 



EDT^S PBICE 
DESCRIPTOPS 



IDENTIFIERS 



HF01/PC02 Plus Postaae. 

♦Comparative Analysis; *Criterion Referenced Tests: 
♦Cut^ina Scores: Elementary Secondary Education: 
Evaluation Criteria: ♦Minimum Competency Testing: 
Quantitative Tests; Peadina Tests; *Scoring Formulas; 
Test Validity 

Anaoff Methods: Contrasting Groups Method: Ebel 
Method; Kansas: Nedelsky Method 



ABSTRACT 

A comparison of four frequently used standard setting 
methods for derivina cut-off scores with respect to the expected 
performance of minimally competent students^ is presented In this 
paper. Ten Kansas Competency Based Tests^ in reading ai»d mathe«aticJ5r 
were administered across five grade levels in a state-wide minimal 
competency testina program. School districts were randomly assigned 
the Angoffr Ebel or Nedelsky standard setting method and 50 districts 
were assianed the Contrasting Groups Method. Descriptions of the 
*ypes of iudgments required ^,nd the procedures used for deriving a 
standard are aiven for each method. Methods of analysis ara 
documented and are followed by results of their discussion* 
Statistical evidence is provided in the eight tables appenied. Table 
B provides a framework within which to view the pattern of results 
f^T the ten tests. Data indicate that performance score standards are 
consistently ranked^ and tha*- discrepancies were substantial between 
methods but Internal consistency was high. It is concluded that^ 
since standard settina methods differ and competency level is a 
continuous variable^ methods are bound to produce different results. 
The superiority of a single method is neither supported by existing 
literature nor data. (Author /AEP) 



♦ Reproductions supplied by EDPS are the best that can be made »• 

* from ^he original document. i 



EKLC 



Presented at the Annual Meeting of the American Educational 
Research Association, Los Angeles, California, April, 1981. 

U^. DEPARTMENT OF EOUCATIOTV 

, .^^ ^^.^ NATIONAL INSTITUTE OF EDUCATION 

"PERMISSION TO REPRODUCE THIS educational resources information 

MATERIAL HAS BEEN GRANTED BY CENTER (ERIC) 

■ f-«^* ^ ^ document has been reproduced as 

.\ . Pochc.! J An Empincai Investigation of the Angoff, Ebel and i-om the person o.9an«tion 

Nedelsky Standard Setting Methods ^ : Mr;:ZU ...e . 



Minor changes have been made to improve 
reproduciion quality. 



I, • Points of tfiew or opinions stated m this docu- 

TO THE EDUCATIONAL RESOURCES ^ . n Df^^f^Ar\ n/Mi/^l<»<* D Pi -i -.-.-I n^. ^ C C^^^ mem do not necessarily represent oHiCwlNIE 

INFORMATION CENTER (ERicr Jofin P. Poggio, Douglas R. Glasnapp and Dawn S. Eros position m poncy 

University of Kansas 

Since the concept of criterion-referenced testing was introduced during 
the 1960's (Glaser, 1963; Popham & Husek, 1969) much attention has focused 

^ cept has shown tremendous growth, example of the widespread adoption of 

CD 
LxJ 



on criterion-referenced measures. Acceptance and application of the con- 



criterion-referenced measures is reflected in the minimum competency test- 
ing movement that began in the early 1970' s. Since that time, 38 states 
have taken some sort of formal action involving competency- based assessment. 
The debate on the merits and demerits of minimum competency testing contin- 
ues. Regardless of the length and breadth of these debates, the reality is 
that minimum competency testing is occuring. An issue which arises within 
a competency testing program is that of determining the performance standards 
or passinc scores. 

A variety of alternative procedures for setting standards have received 
much attention in the literature. Extensive description of the properties 
of the methods, their underlying assumptions, the purposes each addresses 
V and general procedures to follow when setting standards are numerous (Millman, 
^ 1973; Meskauskas, 1976; Jaeger, 1976; Glass, 1978; Hambleton, 1978; Hambleton, 
^ Powell & Eignor, 1979; Shepard, 1980). However, in the opinion of many of 
^ these authors, there is little empirical evidence to offer users the necessary 
guidance in choosing among these methods for setting standards. To date. 



1. The research reported in this paper was supported by a contract from 
the Kansas State Department of Education. 



FRIC 



2 

only a few investigations have been conducted comparing the performance of 
different methods. The results of these studies tend to be mixed. For 
example, Andrew and Hecht (1975) compared the procedures proposed by 
Nedelsky (1954) and Ebel (1972), as did Skakun and Kling (1980). The lat- 
ter, however, used a modification of the Ebel procedure which prevents 
genera lizability of the results to non-similar situations. The Nedelsky 
procedure was also used in studies by Brennan and Lockwood (1979) and 
Koffler (1980), comparing it to a procedure proposed by Angoff (1971) and 
a procedirrs known as tne Contrasting Group methods, respectively. Across 
studies, there is a lack of consistency in type of design employed^ which 
standard setting methods were used, whether focus was solely on the actual 
standards derived or whether an examination of the methods' psychometric 
properties were included. Hence, there is still little conclusive compar- 
ative evidence to assist users in choosing among alternative standard set- 
ting methods. 

The purpose of the present investigation was to simultaneously compare 
four frequently used standard setting methods: Angoff, Ebel, Nedelsky and 
Contrasting Groups. Within the context of a state-wide minimal competency 
testing program, comparisons of the four derived cut-off scores were possi- 
ble for ten different tests (two content areas, reading and mathematics, 
across five grade levels, 2, 4, 6, 8 and 11). However, comparisons were not 
limited to the description and pattern of discrepancies among standards 
across methods. Rather, evaluation of the procedures also included extensive 
examination of the psychometric properties of each, including reliabilities 
and validities of both judges' ratings and judgments about students using each 
standar d. 

o 

EKLC 



3 

METHOD 

Context of Data Collection 

During the 1979-80 school year, all 2nd, 4th, 6th, 8th and llth grade stu- 
dents in the State of Kansas were required to take the Kansas Competency-Based 
Tests in reading and mathei|atics. The number of competencies assessed in each 
content area were 15, 20, 20, 20 and 19 for the five grade levels, respectively. 
Three test items were used to assess each competency resulting in test lengths 
of 45 items at Grade 2, 60 items at Grades 4, 6 and 8 and 57 items at Grade 11 
for each content area assessed. All test items were in a multiple-choice 
format. Each item was presented with four alternatives. 

As part of this testing program, performance standards for judging minimal 
competency were required for each grade level and test area. Because no one 
method available appeared to be superior, it was decided to collvect standard 
setting data using a variety of procedures. A synthesis of these data were 
used to set the performance standards for the State. 
Data Collection Procedures 

Approximately 60 percent (198) of the State's school districts volunteered 
to participate in the standard setting activities. • Each -participating district 
~ was randomly assigned one of three methods (Angoff, Ebel or Nedelsky) to use 
in setting standards. In addition to using a specific procedure, each district 
was requested to provide standard setting responses for six of the ten tests 
available. The pattern for which of the six tests was assigned to a district 
was chosen from one of ten patterns to assure an approximately equal number 
of ratings for each test using each procedure. 

Within each district, th-} test coordinator we-i directed to select an exper- 
ienced educator at each grade level and content area for every one of the six 
different tests assigned to ^.he district. It was suggested that the individuals 



EKLC 



4 



who were to participate in setting standards be knowledgeable and have at 
least two years teaching experience in the content area at a specific test 
grade level. Six packets of materials, each containing a copy of the test 
to be rated and a set of instructions for the standard setting procedure 
to be used, were sent to the test coordinator for distribution to the edu- 
cators selected* The raters reviewed and judged the tests, based upon the 
set of instructions detailing the method they were to use, just prior to the 
actual administration of the tests to the students. The responses- of each 
rater were returned to the district test coordinator who f^^^rwarded them to 
the investigators. 

In all, usable standard setting data was obtained from 926 teachers. 
Examination of the demographic characteristics of the group Iyidic5ted that 
teachers selected by their districts to participate were well qualified for 
the task in terms of years teaching experience (X = 13)^ years vn-ith the 
district (X =» 8.3), professional training (68 percent with work beyond 
bachelors degree) and type of responsibility (97 percent currently reaching 
the content at the grade level on which they set standards). Table 2 ident- 
ifies the number rf respondents who provided data for L given grade level 
by area test using a specific procedure. The group sizes ranged from a low 
of 24 (Nedelsky - Reading - Grade 6) to a high of 41 (Ebel - Math - Grade 11). 

In addition to the judgmental ratings of test items;^ using the Angoff , 
Ebel or Nedelsky procedures, data also wer? collected appropriate for setting 
standards using the Contrasting Groups Method. The test coordinators from a 
representative sample of 50 districts were requested to randomly select one 
elementary and one junior high building in their district. At these build- 
ings^ all second, fourth, sixth and eighth grade students were to be rated 

5 

ERIC 



by their- taacher on overall competence in reading and mathematics given the 
State competencies. Packets of materials containing specific directions to the 
building principals and teachers and rating forms accompanied the tasting mat- 
erials. The rating directions to the teacher indicated that they should only- 
rate a student in math or reading if they were responsible for the student's 
instruction in that area. A list of State content area competencies were in- 
cluded with each rating form and the teacher was asked to review these object- 
ives prior to making the individual ratings. All teacher ratings were recorded 
in a special codes section on the student's test answer sheet. Table 1 ident- 
ifies the number of students rated by grade level and content area. The number 
of students rated by teachers as minimally competent and not minimally competent 

also are provided. In total, 12,575 ratings v/ere received (6278 in reading, 
6297 in math). 

Insert Table 1 

Standard Setting Methods 

As noted by Jaeger (1979) all standard setting methods involve jt;dv,r^^:rta1 
decision making at some level and differ only by the . . proximity of 
judgment— determining data to the original performance." (p. 48) As such, he 
suggests classification of methods under either a proximal (direct) or distal 
(derived) model, referenced as judgmental or empirical models, respectively* 
Jaeger also includes the judgmental— empirical combination methods within 
the proximal model. 

Within the cont<»xt of the present study, use of methods included in the 
proximal model classification was appropriate. The Angoff, Ebel and Nedelsky 
procedures are based on expert judges' assessments of individual items cont- 
ent with respect to the expected performance of minimally competent students. 



6 



Although the specific manner in which items are assessed differs across the 
methods, all are purported to yield an overall score which would or should 
be attained by minimally competent students, thereby providing a standard 
for judging the competence of an individual. For these methods, both the 
judgments made and the standards derived are independent of the actual per-- 
fcnrance of students on the test. Using these procedures, passing scores 
may be set prior to the test administration. 

Thfi fourth method used. Contrasting Groups, also involves expert judg- 
ment. HDwever, the focus is on making judgments about actual individual 
test takei-s rather than on test item content. Judges' classificatign of 
students as either competent or non-competent serve to define two "known 
groups." As with the ttsm i'n^o^ction methods, the judgments made are in- 
dependent 0-^ Ritual test performance. However, the final standards are de- 
pendent on performance, being aerivec ,3 "maximize" correct classification 
of students Into the g'roups to which they're judged to belong. For each 
(fiethod, a brief desariptiofl of the type of judgment(s) required and the pro- 
eedure used for deHving a standard follows. 

Angoff method. For each item, our educatcrs were asked to estimate the 
probability (on a scale of 0 to 100) that a minimally competent student would 
correctly answer the item, in essence, the judgr'^ were estimating the diffi- 
culty level of an item referencing a hypothetical group of individuals who 
would be Judged as minimally competent. To obtain the overall standard, prob- 
abilities assigned by a judge were converted to proportions and suimied; the 
average of these sums, across judges, provided the final passing score. In 
effect, this standard represents the (estimated) mee:? total score for a group 
of minimally competent individuals. As such, any student scoring below the 
mean of this reference group i* uld not be judged to be competent. 



7 

Ebel methad . For this method, judges made three separate types of judg- 
ments. First, judges rated each item on two separate dimensions: level of 
difficulty (easy, medium or hard) and level .of relevance (essential, import- 
ant, acceptable or questionable). For each judge, then, all items could be 
classified into one of 12 cells in a 3 x 4 grid defined by the three diffi- 
culty and four relevance categories. Judges then indicated the percentage 
of items within each of the 12 cells that a student should answer correctly 
in order to be judged minimally competent. To derive the standard from 
these data each item was assigned to one of the twelve cells based on the 
teachers' ratings. The percent passing judgment for a cell was then multi- 
plied times the number of items in a cell and these products were summed 
over all 12 cells to get an overall passing score for a judge. Thesfe pass- 
ing scores were then averaged over judges to get the composite passing 
score. Unlike the Angoff and Nedelsky procedures," interpretation of the 
meaning of this standard is not as precise. While difficulty of an item 
is reflected, it is not necessarily the major determinant. Differential 
weighting due to item relevance and whether judges indicate that, for 
example, more "hard" items of some sort "should" be answered correctly 
than "easy" items of another sort will affect the overall level of the 
standard. 

Nedelsky method . For each item, judges were asked to indicate which, 
if any, distractors the minimally competent student should be able to 
eliminate as incorrect. The score for an item then became the reciprocal 
of the number of alternatives not eliminated. For an individual student 
this score represents the probability of correctly answering the item, the 
"chance score." Although obtained through a different process, these scores 
carry the same type of interpretation as those from Angoff. For a (hypothet- 



• . a 

ical) group of minimally &petent students, the score represents the item's 
difficulty level. To obtain the overall standard, the item scores are sunned 
^ for each judge and these sums are then averaged across judges. This standard 
also represents the mean total test score expected from the reference group. 

The Nedel sky method also includes a component which allows the user to 
detennine what percentage of the group of minimally competent students should 
fall above and below the standard. The assumption is made that the variability 
in individual judge's standards equals that of the total scores of the refer- 

a. 

ence group. Using the standard deviation of the judges scores as an estimate 
of that of the reference group, adding or subtracting a constant number of 
standard deviation units (j^) to the original passing score decreases or in- 
creases, respectively, the number of minimally competent students judged as 
competent. There does not appear a consistent recommendation in the literature 
for the value to be assigned to Jc. For the purpose of this study, Jc was assigned 
a value of 1. , 

Contrasting Group method . For a sample of students who will have scores 
on the test, each is classified by a judge into one of two groufps. Competent 
or Not Competent, relative to the content being assessed. Based upon this 
group membership classification and the actual test scores of these students, 
a standard is derived using statistical likelihood ratio procedures which ^ 
minimize the probability of misclassifi cation of students into groups. 

There are several variants in the specific statistical procedures avail- 
able. Choice of the appropriate variant is dependent upon the population dis- 
tribution shapes and relative variances of the two groups' test scores. When 
both groups are normally distributed and have equal variances, the Linear Dis- 
criminant Function (LDF) derived by Fisher 0936) is the appropriate statistic 

o 

ERIC 



with nonnallty but unequal variances the appropriate likelihood ratio statistic 
is the Quadratic Discriminant Function (QDF). Finally, with non-normal distri- 
,j;i.butions, non-parametric analogs to the LDF and QDF (for equal and unequal 
variances, respectively) are appropriate. In the present study both the 
normality and equal variance assumptions were violated, making the non-para- 
metric QDF procedures appropriate. Throughout the present investigation, the 
methodology detailed hy Koffler (1980) was followed, setting the "costs" of 
false positives and false negatives equal in all situations. ^--.^ 
Methods of Analysis 

Several different indices were available to serve as the units of analy- 
sis. The Angoff, Ebel and Nedelsky methods provided individual item ratings 
on one or more dimensions for each ''judge plus a total test composite standard 
was derived for each judge. In addition to the data provided by the judges, 
item difficulties and total test score distributions were known based on 
actual student performance data. 

All analyses performed were descriptive in nature. The intent of the 
analyses were to provide comparative descriptive information resulting from 
application of each of the methods. The statistical information provided 
includes: 

1. Distributional characteristics of the judges' composite scores 
within a group. 

2. Indices of reliabilities of the judges' ratings. 

3. Correlations among the various indices resulting from the sta"^:dard 
setting methods using the item responses as the units of analysis. 

4. Squared-error loss reliability (Brennan, 1980) estimates (indices 
of dependability) for total test scores given a passing score de- 
termined by a standard setting procedure. 

c». Classification agreement tables on student designation of competent 
based on student performance data using the passing score determined 
by each method as one classification criteria and teacher judgments 
as the other criteria. 



V 'TO 
. • - ' - . ■ ' . ■ 

All results aire provided for ten replications apross the five grade levels and 

two 'content areas. ' 

RESULTS 

Tab.le 2 provides Information on the distriijutional characteristics of the 
Angoff, Ebel and Nedelsky methods. These indices were calculated using the 

- , 

total test composite ratings as. the unit of analysis. Several points should be 
noted from these results. The distributions of the ratings for all methods 
have a consistent negative skew, although in most cases it is only slight. As 
expected i ni in egatively .skewed distributions, the means tends to be slightly . 
smaller than the medians. The difference" in these two measures of central 
tendency is negligible in most instances, particularly for the Ebel and 
Nedelsky procedures. Greater discrepancies exist in the Angoff procedure 
where ratings tend to be more negatively skewed than in the Ebel or Nedelsky 
methods. The last distributional descriptive index of importance is the 
variability of the judges' ratings. As indicated, the Angoff procedure re- 
sults in ratings which are considerably more variable than for either of the 
other two procedures. 

The most important comparative information in Table 2 is the restOting 
performance standard suggested by each procedure for the same test. Using 
the means as the performance standard, tb.e Ebel procedure consistently re- 
sults tn the fvtghest score. The Angoff procedure tdenttfies a passing score 
in the same region of the distribution, but is generally one to five score 
points lower. Only in the case of the second grade mathanatics test^was the 

i 

Angoff procedure considerably higher than the Ebel method. The suggested 
passing scores from both of these procedures were substantially higher than 
those resulting from the Nedelsky procedure, even if the value of k^ is set 
at 3 or 4. 



, - Insert Table 2 

As a measure of the consistency of ratings across Items and Judges, thb 
reliability of ratings on several dimensions were computed using analysis of 
variance methods resulting In alpha (a) coefficients of Internal consj^tenp^ 
For the Angoff procedure, only one rating was available for use In the calcu 
latlon of a reljab^illty Index. The Ebel procedure provided four ratings. 
Item difficulty. Item relevance, the assigned cell percentage and the Item 
c9mpos1te based on the other three ratings. Two scores result from the 
Nedelsky procedure, tt\e spec1f1c\alternat1ve(s) selected and the number pf 
alternatives selected per item. Reliability coefficients were calculated 
for each of these stores, Jhe one Angoff item rating, the four Ebel measures 
and^the two Nedelsky scores. These reliability coefficients for each pro- 
==^cedure are given; in Table 3 for each grade level and content area tested. 
All of the coefficients are high with .89 the lowest coefficient. The 
^indiv1dua,l Ebel ratings tend to be -lowest (lower to mid .SO's), but the 
judgment reliabilities'for the item composite scores are high for all three 
methods C.98 and ;99). . The donsistency of the judges in rating the items 
appears to be exceptionally stable for all three procedures. 



— -Insert Table 3 

Information in Table 4 presents bivariate correlations between actual 
item difficulties on the Kansas Competency Tests and the' item probabilities 
assigned by judges to the items in tha standard setting process using 
Angoff, febeT and Nedelsky methods. In reviewing these data recal-?'fhat in 
applying the standard setting procedures the referen.t group is "minimally 



1^ 

competent students," Item difficulties were computed across all students 
tested at a grade level (N 32,000), A number of patterns emerge 1n these 
data. Overall the Angoff method shows the highest levels of association 
with the actual item diff iculties, ^he Ebel difficulty component closely 
parallels that of the Angoff appraisals. However, the Ebel difficulty 
ratings (easy, medium, hard) when blended with the remaining, facets of 
the method produces a composite evaluation that is not at all consistent 
with the observed item difficulties. Item difficulties when correlated 
with the Nedelsky item probabilities tend to be lower and less variable 
over replications than the other methods. 



Insert Table 4 

To further explore these data, item probabilities were then correlated 
over methods. Results from this analysis are given in Table 5, Most ap- 
parent in these data are the rather high concurrent coefficients between 
the Ebel difficulty item evaluations and the ratings of items by the 
Nedelsky and Angoff methods. The pattern appears to be consistent with 
the exception of very low correlations at Grade 8. Correlations between 
the composite item probabilities tend not to be as high. However, the 
correlations between Angoff and Ebel item composite ratings are consider- 
ably higher than between Nedelsky and either of the other two. The pattern 
of correlations suggests the presence of the difficulty component affecting 
ratings across all three methods, yet beyond this aspect each method appears 
to capitalize on aspects of specific variance unique to itself. 

Insert Table 5 



EKLC 



1 Q 



13 

The next stage of analyses addressed the issue of characteristics re- 
suiting from classification given the test standard/criterion. Table 5 re- 
ports indices of dependability given select criterion scores on each of the 
Kansas tests., The method used to produce these indices has been identified 
as a squared-error loss approach (Berk, 1980; Kane and Brennan, 1980) • This 
method has two characteristics worth noting. First misclassiflcation (master/ 
non-master) further from the cut-score is treated more seriously than mis- 
classification based on scores close to the cut-score. Second, the distribu- 
tion of coefficients of dependability is "u shaped" and coefficients are low- 
est as the cut-score approaches the raw score mean. Coefficients are com- 
puted based on a single test administration. Values reported in Table 6 
are not corrected for chance placements. 



Insert Table 6 

The values underscored in Table 6 are the agrcemerrt coefficients associated 
with the cut-score produced by each method investigated for each test. The min- 
imum passing score derived from the Nedelsky procedure was treated as the mean 
rating assigned over judges plus one standard deviation unit of the judges 
ratings (7+ 1^). Criterion scores for the remaining methods are based on the 
rating over judges. A consideration of these data shows the Nedelsky minimum 
passlnq score over tests resulting in the highest agreement coefficients for 
the methods considered. The lowest coefficients result with the Ebel method. 
These /results are due to the fact that the standard from the Ebel procedure 
consistently approximates the actual test raw score mean. Overall, however, 
the observed Indices of dependability appear higher. Although consistent 
discrepancies are noted, the magnitude of these discrepancies are mediated 
by the specific techniqifi used and not correcting for chance placements. 



14 



14 



Table 7 surranarizes the results of classification based on teacher julg- 
ment and score classification derived using the Nedelsky, Ebel , Angoff and 
Contrasting Groaps approach. Raw score freq^^fincies are reported. The de- 
rived minimun passing score (standard) found for each procedure is also 
given. For this analysis teacher classification at each grade level and 
in the content area tested served as the criterion variable. The Contrast- 
ing Groups standards were derived from these teacher classifi mtions (sev^ 
. Methods section). As sirh the fit of this pv^ocedure to the actual classi- 
fication of obtained test scores might be expected to be quite stable for 
the sample of students for whom these data were available. 

From the data presented a number of findlrtgs emerge. It is evident 
that the rank ordering of the procedures studied finds the Nedelsky method 
resulting oonsistently in the lowest raw s-iore standard, followed ty the 
Contrasting Gro ips standard, then the Angoff standard, while the highest 
standard is yielded by the Ebel method. The pattern of standards being 
computed for tests is one of substantial variability. That is, for the 
most part the standard suggested by methods for a given test tend to be 
quite disparate. For a given test, the order of the performance standards 
resulting from the four proceduifes is consistent. The Nedelsky procedure 
always results in the lowest standard followed by the Contrasting Groups, 
Angoff and Ebel in that order. The four procedures also result in perform- 
ance standards which tend to have different degrees of variability in loca- 
tion across the ten different tests. The performance standard identified 
by the Contrasting Groups ranged from 50 percent (Math-Grade 8) to 70 per- 
cent (Reading and Math-Grade 4) of the items correct. This range ignores 
Math-Grade 2 where the resulting standard was set at a score of zero. The 

EKLC 



Angoff procedure resulted in performance standards from 65 percent to 88 
percent of the items correct. The range for the Ebel method was 72 per- 
cent to 84 percent of the items correct and for Nedelsky 47 percent to 
50 percent of the items correct. 

Across grade levels the number of students judged as not competent 
by teachers in grades 2, 4, 6 and 8 were 3 percent, 7 percent, 7 percent 
and 4 percent, respectively in reading and 1 percent, 7 percent, 8 per- 
cent and 9 percent, respectively in mathematics. Unfortunately, there 
was considerable variability and overlap in the actual test performance 
scores for students judged as competent and not competent by the teachers 
The passing score derived from the Contrasting Groups procedure minimizes 
the number of misclassification errors given the student performance data 
Using the teacher judgments as the criteria for correct classification, 
the frequency of a specific type of classif icatio -^or for the other 
three procedures is dependent on how far their standard is below or above 
the Contrasting Groups* standard. The Nedelsky approach results in a 
greater percent of false masters, while the Angoff and Ebel standards re- 
duce this form of error while increasing the occurence of false non- 
masters. In considering these data, recall that the Nedelsky approach 
requires defining a valu^, k^, which is then applied to the mean rating 
over judges. In practice this value would be set based on discussion 
and negotiation. For the present study this value was taken as 1 based 
on suggestions from Nedelsky's writing on the topic (1954). Had this 
value been different, either greater or smaller, then the classification 
glata in Table 7 would change. 



•!6 

DISCUSSION 

Table 8 provides a framework within which to view the pattern of results 
that emerge from this investigation. Presented are select descriptive statist- 
ics associated with each of the 10 tests that fonned the basis of this invest- 
igation (Pogglo and Glasnapp, 1980). A review of these data suggest that the 
tests, based on pupil performance, provide a variety of replications over which 
to consider the general izabi 1 ity of the findings of this investigation of stand- 
ard setting methods. 

The performance score standards resulting from the application of the four 
procedures are consistently ranked in the same order with Ebel producing the 
highest standard, then Angoff, then Contrasting Groups and Nedelsky with the low- 
est standard. The result that none of the procedures consistently produce the 
same standard support the findings from previous studies (see, for example, 
Andrew & Hecht, 1976; Koffler , -1980; Skakun & Kling, 1980). The added signific- 
ance of the present finding 1s that no previous study had compared all four pro- 
cedures within a single context. In addition to producing different performance 
standards, the discrepancies in most instances were substantial. 

The Internal consistency of the ratings within a given procedure was extremely 
stable. Coefficients of .98 and .99 were obtained for all procedures. The valid- 
ity information, however, did show differences across the procedures. Using item 
difficulties based on student perfonnance as an external criteria, the correla- 
tions with item ratings from each procedure produced moderate coefficients in 
the .40 to .70 range. The intercorrelations among the Angoff, Ebel and Nedelsky 
item Indices indicated moderate to high coefficients between Angoff and Ebel 
ratings with low coefficients between Nedelsky and Angoff and Neuelsky and Ebel • 
Each procedure would appear to be asing perceived item difficulty as a basis 

1 n 

EKLC 



for judgments, but the specific directions for a procedure tend to alter 
these perceptions and creates variability in item ratings unique to a pro- 
cedure- 

For any pair of methods, the Ebel and Angoff procedures appear to be most 
similar in both the pattern of ratings across the same items and in the result- 
ant performance score standard which is produced. Of these two, the Angoff 
composite ratings exhibited substantially more variability across raters than 
did the Ebel procedure. The consequence of this variability is that the 
standard error of the Angoff performance standard would be considerably 
greater than that of the Ebel procedure. Across groups of similar judges, 
the Ebel procedure would produce the more stable performance standard. 

It. is difficult to Interpret the error classification rates in Table ?• 
Comparative interpretation is relative to the validity assumed for . ^ie 
independent teacher judgments of student competency. The evidence that exists 
suggests that teacher classification if, not very highly related to actual stu- 
dent test performance. Correlations between these two indices for Grades 2, 
4, 6 and 8 are .51, .62, .60 and .52 in reading, respectively, and .41, .57, 
.65 and .67 in mathematics, respectively. Given this relationship, 1t Is 
difficult to justify a decision which sacs teacher classification as the 
ultimate criterion. 

The Ideal validity study on standard setting procedures would result In 
the same decisions across all procedures used. However, the data in the pres- 
ent study accentuates the fact that the procedures studied are different 
rather than similar and produce different results. The conclusion that 
one Is superior to another cannot be drawn without a consensus 
external criterion. Rather, the information provided on each procedure should 
offer a basis for a more valid selection of a method if standard setting is 
desired. 

IS 



18 



The lack of similarity among procedures supports the position that level 
of competency needs to be viewed as a continuous variable. If a testing pro- 
gram is oriented toward the purpose of providing information and feedback 
about a group rather than certification, it would seen desirable to provide 
performance information for a variety of cut-points, e.g. , 90 percent, 80 
percent, 70 percent, etc. of the items correct. Those individuals respons- 
ible for making policy decisions based on the data may then impose their own 
internal standards when recommending decisions. The use of a single method 
to set a performance standard is arbitrary and the existing literature 
and present data would not support the superiority of any one of the four 
methods investigated. 



9 



1 



ERIC 



REFERENCES 



ANDREW, B.J. & HECHT, J.T. A preliminar> investigation of two procedures for 
examination standards. Educational apd PsvchQloa-jral MeRsurqmpnt . 1976, 
36, 45-50. . 

ANGOFF, W.H. Scales, norms and equivalent scores. In R.L. Thorndlk (Ed.) 

Educational Measurement . Washington, D.C.: American Council on Education 
T97T"; 

BERK, R.A. A consumers' guide to criterion-referenced test reliability. 
Journal of Educational Measurement. 1980, 17, 323-349. 

BRENNAN, R.L. Applications of general izability theory. In R.A. Serk (Ed.), 
Criterion-referenced measurement: The state of the art. Br:ilt1rnore, MD: 
The Johns Hopkins University Press, 1980. 

BRENNAN, R.L. & KANE, M^T. An index of dependability for mstt-ry tests. 
iMnjil! ^ Educational Measurement. 1977, 14, 277-289(3). 

BRENf-IAN, R.L. & LOCKWOOD, R.E. A comparison of two cutting score procedures 
using general izability theory. Paper preisented at the Annual Meeting of 
the National Council on Measurement in Education, San Francisco, 1979. 

EBEL, R.L. Essentials of educational measurement . Englewood Cliffs, N.J.: 
Prentice-Hall, 197^" • 

FISHER, R.A, The use of multiple measurements in taxonomic problems. Annals 
of Eugenics , 1936, 7, 179-188, 

GLASER, R. Instructional technology and the measurement of learning outcomes. 
American Psvc-jioloqist . 1963, 18, 519-521 . 

GLASS', G.V. Standards and criteria. Journal of Educational Measurement . 1978, 
15, 237-261. 

HAMBLETON, R.K. On the use of cut-off scores with criterion-referenced tests i 
Instructional settings. Journal of E ducational Measurement . 1978, 15, 277 
289. 

HAMBLETON, R.K., POWELL, S., & EIGNOR» D.R. Issues and methods for standard- 
setting. Amherst, MA: School of Education, University of Massachusetts, 
1979. 

JAEGER, R.M. Measurement consequences of selected standard-setting models. 
Florida Journal of Educational Research , 1976, 18» 22-27. . 

KOFFLER, S.L. A comparison of. approaches for setting proficiency standards. 
Journal of Educational Measurement, 1980. 17, 167-178, 



ERIC 



2Q ^ 



4 



References (contd) 

MESKAUSKAS, J. A. Zvaluation models for criterion-referenced tasting: Views 
regarding mastery and standard-setting. Review of Educatio nal Researc h, 
1976, 46, 133-158. 

MILLMAN, J. Passing scores and test lengths for domain-referenced measures. 
Review of Educational Research, 1973, 43, 205-216. 

NEDEI.SKY, L. Absolute grading standards for objective tests. Educational 
and Psychological Measurement . 1954, 14, 3-19. ~ 

POGGIO, J. P. &GLASNAPP. D.R. Report of research findings: The Kansas Compet- 
ency Testing Program— 1980. .Topeka, KS: Kansas State Department of Edu- 
cation, 1980. 

POPHAM, i;.J. & HUSEK, T.R. Implications of criterion-referenced measurement. 
Jo urnal of Educational Measurement . 1969, 6, 1-9. 

SKAKUN, E.N. & KLING, S. Comparability of methods for setting standards. Journal 
of Educational Measurement . 1980, 17, 229-235. . 

SHEPARD, L.A. Technical issues in minimum competency testing. In D.C. Berlinger 
(Ed.) Review of Research in Education . (Vol. 8) Itasca 111: F.E. Peacock. 
1980. 



21 



ERIC 



Table 1 

Number of Students Rated by Teachers 
on Competency In Reading, and Mathematics 
by Grade Level 



Teacher Ratings/ 
Classification 



Grades 



8 



Minimally Competent 
Not Minimally Competent 



TOTAL 



Reading 



1290 
38 



1335 
93 



1353 
101 
TOT 



1982 
86 



Minimally Competent 

Not Mi nimal ly Competent 



TOTAL 



Mathematics 

^ 1299" 
18 
TSTT 



1340 
103 



1307 
117 
TO? 



1923 
190 
ITTJ 



22 

ERIC 



Table 2 

Distributional Characteristics of the Judges' Ratings 
For the Angoff, Ebel and Nedelsky Procedures 



Angoff (N»312) 





N 


J 


Mdn 


^1 


^3 


S 


SK 


Ku 


R2 


37 


36.4 


37.4 


33,9 


39.9 ' 


4,5 


- .9 


- .2 


R4 


32 


42.4 


45.1 


34.4 


52.0 


11.2 


- .5 


-1.3 


R6. 


28 


43.3 


42.8 


38.6 


57,5 


7.8 


- .8 


.3 


R8 


35 


42.3 


44.3 


35.6 


50.6 


9.8 


- .7 


.1 


Rll 


30 


41.5 


43.0 


37.1 


, 47.0 


7.8 


-1.1 


1.2 


M2 


32 


39.1 


39.8 


36.8 


41.8 


3.8 . 


-1.3 


2.3 


M4 


26 


45.6 


48.6 


43.2 


51.2 


8.7 


-1.3 


.6 


M6 


33 


42.9 


43.6 


37.8 


57.5 


7.5 


- .5 


- .1 


M8 


28 


38.2 


39.3 


32.3 


45,5 


10.1 


- .3 


- .7 


Mil 


31 


36.9 


39.3 


30.0 


44.8 


10.3 


- .6 


- .6 



EiJel (N=337) 





-N 


r 


Mdn 


^1 


Q3 


S 


SK 


Ku 


R2 


36 


37.3 


37.7 


36.8 


38.9 


?..4 


- .7 


.1 


R4 


25 


42.2 


42.0 


39.6 


44.8 


3.9 


.2 


.1 


R6 


31 


46.3 


46.8 


44.1 


48.2 


3.1 


- .2 


- .7 


R8 


38 


47.3 


47.4 


45.1 


49.5 


2.7 


- .01 


- .8 


Rll 


37 


45.0 


45.1 


43.4 


47.1 


3.2 


- .3 


- .8 


M2 


37 


37.4 


37.2 


36.2 


' 41.3 


2.4 


-1.0 


1.0 


M4 


33 


46.3 


46.2 


44.4 


48.3 


3.1 


- .3 


.1 


M6 


31 


46.8 


45.6 


44.4 


49.3 


3.3 


.6 


-1.1 


M8 


29 


44.9 


45.0 


43.6 


46.7 


•3.2 


- .7 


.9 


Mil 


41 


43.4 - 


43.V.1 


41.5 


^ 44.9 


2.4 


.6 


.8 



Nedelsky (N=277) 





N 


I 


Mdn 


^1 


^ Q3 


S 


SK 


Ku 


R2 


32 


19.1 


19.4 


16.8 


21.3 


2.7 


- .5 


-1.0 


R4 


23 


25.2 


24.8 


22.8 


28.1 


3.3 


- .2 


- .7 


R6 


24 


24.2 


23.8 


21.3 


27.7 


3.7 


■ .5'- 


-1.0 


R8 


30 


25.0 


25.2 


23.2 


27,0 


2.4 


.1 


-n.o 


Rll 


29. 


23.3 


23.3 


21.5 


24.8 


2.1 


- .3 


- .7 


M2 


32 


18.0 


18.0 


16.2 


20.3 


2.9 


- .1 


-1.0 


M4 


25 


25.1 


25.3 


22.5 


27.5 


3.6 


- .2 


-i:i 


M6 


25 


25. 5 


•27.0 


25.5 


27.7 


2v6 


- .8 


- .1 


MB 


28 


25.0 


25.0 


23.2 


27.0 


2,9 


- .2 


- .6 


Mil 


29 


20.5 


21.. 1 


17.7 


22.9 


2.9 


- ,3 


-1.4 



00 



ERIC 



Table 3 



Internal Consistency Reliability Coefficients 
For Judges' Responses Within the Angoff, Ebel and 
Nedelsky Procedures 





Angoff 


Nedelsky 




Ebel 










Alternative 


Composi te 


Difficulty 


Relevance 


Cell% 


Compos' 


R2 


.98 


.97 


> 

.97 


.96 


.95 


.89 


.98 


R4 


.99 


.93 


.97 


.94 


.97 


.SO 


.98 


R6 


.99 


.98 


.99 


.93 


.94 


.93 


.98 


R8 


.99 


.57 


.98 


90 


92 


93 




Rll 


.97 


.98 


.98 


.94 


.99 ■ 


.94 


.99 


M2 


.96 


.96 


.98 


.93 


.91 


.91 


.98 


M4 


.98 


.98 


.99 


.90 


.93 


.91 


.97 


M6 


.98 


.96 


.98 


.93 


.99 


.95 


.99 


MB 


.99 


.98 


,99 


.96 


.95 


.91 


.97 


Mil 


.99 


.99 


.99 


.92 


.95 

, . . . . .- / 


.95 


.98 



ERIC 



24 



Table' 4 



Correlations Among Standard Setting Method 
Item Ratings and Test Item Difficulties 



METHOD 



Test (# items) 


Anaoff 


tlfi^lelsJiy 




Ebel 










Difficulty 


Relevance 


, Composite 


Reading 2(45) 


.66 


.52 


;63 


.25 , 


.47 


Reading 4(60) 


.66 


.56 


.73 


.25 


.71 


Reading 6(60) 


.49 


.47 


.51 


.56 


.28 


Reading 8(60) < 


. . .38 


.52 


.15 




.40. 


Reading 11(57) 


.46^ 


;31 


.45 


.19 . , 


.42 


Math 2(45) 


.71 


.50 


.59 


.43 


.55 


Math 4vd0) 


.74 


.52 


.72 


.37 


.56 


Math 6(60) 


.81 


.48 


.73 


.56 


,65 


Math 8(60) 


.41 


- .28 


.19 


.24 


.62 


Math 11(57) 


.54 


' .46 


.48 


. .29 - 


.55 



2*^ 



ERIC 



Table 5 
- . 'I .- , 

/ (brrelations Among Angoff, Ebel and 

^fede^sKy Item Ratings 



Angoff with Nedelsky with 



Test 


Nedelsky 


ebel(Uiff) 


EbeURel) 


Ebel(Comp) 


Ebel(Oiff) 


• EbeURel) 


Ebel(Comp) 


R 2 


.718 


.937 


.727 


.865- 


.818 


.565 


.802 


R 4 


.778 


924 


473 


, . or'r 


m / 10 


AM 
• Wo 


. /Ud 


R 6 


.545 


.925 


635 




• sjCO 


m OCi 




R 8 


.542 


.136 


647 ' 




• 1 






Rll 


.559 


.886 


398 


734 


. uwU 






M 2 


.483 


.942 


.742 


.902 


.437 • 


.312 


.350 


M 4 


.626 


.897 


.701 


.827 


.740 


. .252 


.503 


M 6 


.510 


.888 


.687 


.866 


- .401 


" .202 


.359 


M 8 


.123 


.914 


.821 


.561 


.061 


.021 


.237 


Mil 


.454 


.850 


.176 


.740 


;477 


.035 


.509- 



7^ 



ERIC 



(able b 



Indices of Dependability for Minimum 
Passing Scores Suggested by Each Procedure 



Criterion 
Score 



Reading - 



18 

19 

20 

21 

22 

23 

24 

25 

26 

27 
-28 
.29^^ 

30 

31 

32 

33 

34 

35 

36 

37 

38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 



2 


4 


6 


8 


11 


. 992 


. 992 


.989 


.993 


.995 


. 991 


.991 


.988 


.992 


.994 


.990 


.991 


.987 


.992 


.994 


% . 989 


.990 


.986 


.991 


.994 


.9SSN 


.989 


.985 


.990 


.993 


.987 


.988 


.984 


.990, 


.993 




.988 


r983 


.989 


.992 


.984 


.987 


.982 


.988 


.992 


.982 


.986 


.980 


.987 




.979C 


.985 


.978 


.986 


.990 


.976 


.983 


.97^ 


.985N 


.989 


.973 


.982N 


.974 


.984 


.988 


.969 


.980 


.972 


.982 


.987 


.963 


.979 


.968 


.981 


.986 


.957 


.977 


.966 


.979 


.984 


.949 


.974 


.962 


.976 


.983 


.940 


.972 


.958 


.974 


.981 


. 9z9 


. 969 


.953 


.971 - 


.979 


.917 


.965 


.948C 


.968 


.976 


.905A 


.962 


.942 


.964 


.973 




.957 


.935 


.959 


.969 


.887 


.952 


.9^8 


,954C 


.964 


.886 


, 947 


.926 


.948 


.958 


.891 « 


.941 


.913 


.940 


.951 


.961 




. .9.03 


.932 


•J42A 


.913 


;,929A, 


e.898 


.923A 


.W~ 


.926 


.922 


.892A 


.912 


.918 


.937 


.W6 


.889 


^01 


.902E 




.911 


.888 


.&90 




.908 


,889E 


.880 


.861 




.907 


.893 




.840 




.907 


.899 


.867 


.823 




.910 


.906 


.867 


.815 



.997C 

:wr 

.996 

.996N 

.995 

. 995 

.994 

.994 

.993 

.992 

.991 

.990 

= 989, 

. 987 

.984 

.981 

.977 

.972 

.965 

.956 

.943E 




Mathematics 

4 6 8 



11 



.993 


.992 


.990 


.984 


.993 


.991 


.990 


.983 


.993 


.990 


.989 


.982 


.S92 


.990 


.988 


.980 


.992 


.989 


.987 


.978 


.991 


.988 


.986 


.977 


.990 


.987 


.985 


.975N 


.990 


.986 


.984 


.972 


.989 


.985 


.983 


.970 


.988 


.984 


.982 


.967 


.987 


.98.3 


.980N 


.964 


.986N 


.981 


.978 


.960 


.985 


.980N 


.977C 


. 956 


.984 


.978 


.979 


.952 


.982 


.076 


.972 


.947 


.980 


.974 


.970 


.942 


.979 


.971 


.967 


.937 


.976 


.968 


.964 


.931 


.974 


.965 


.960 


.926 


.971 


.961 


.956 


.921 


.968 


.957C 


.952 


.917 


.965 


.552 


.948A 


.914 


.961 


.948 


.944 


.912A 


J)57C 


.942 


.940 




.952 


.937 


.936 


.914 


.947 


.932A 


.933 


.920B 


.942 


.■927" 


.930 


.937 


.922 


il28E 
.928 




.932A 


.919 


.930 


. 928E 


.917E 


.928 


.936 


.925 


.917 . 


.930 


.941 


.924 


.918 ■ 


.933 


.946 


.923 


.921 


.936 


.951 



A = Angoff Standard 

C = Contrasting Groups Standard 



E = Ebel Standard 
' N = Nedelsky Standard 



EKLC 



Classification Agreement Based on Performance' Standards 
for Each Procedure Using Teacher Judgments as the Criteria 



Teacher 

Classification 



METHOD 



Nedelsky 

N-M- M 



Contrast, 
Groups 
N-M M 



Angoff 
N-M M 



Ebel 

N-M M 



Reading 2: 



Non-Masters 

Masters 

Standard 



4 34 

5 1285 
22 



14 24 

15 1275 
27 



29 9 
184 1106 
37 



29 9 
320 1060 
38 



Reading 4: 

Non-Masters 

M asters 

Standard 



32 61 

27 1308 
29 



68 24 
134 1201 
42 



74 19 
152 1183 
43 



74 19 

152 1183 
43 



Reading 6: 



Non-Masters 

Masters 

Standard 



34 67 

32 1321 
28 



66 35 
134 1219 
36 



84 17 
352 1001 
44 



93 8 

497 856 
47 



Reading 8: 



Non-Masters 

Masters 

Standard " 



8 78 
10 1972 
28 



25 61 
75 1907 
39 



34 52 
149 1833 
43 



48 38 
408 1574 
48 



Mathematics 2: 



Non-Masters 0 18 0 18 10 8 8 10 

Masters a 1298 0 1299 94 1205 42 1257 

Standard 21 0 ' 40 33 



Mathematics 4 : 



Non-Masters 0 103' 69 34 83 20 87 16 

Masters 24 1316 149 1191 _ 245 1095 267 1073 

Standard 29 42 46 47 



Mathematics 6: 



Non-Masters SI 66 92 25 102 15 109 8 

Mastar s 46 1261 146 1161 267 1040 426 881 

Standard 30 38 ^43 47 



Mathematics 8: 



Non-Masters 57 133 70 120 122 68 137 S3 

Masters SO 1875 64 1859 300 1623 563 1360 

Standard 28 30 39 45 



ERIC 



23 



Table 8 



Descriptive Statistics Found for 
the Kansas Competency Tests 



Area ' Grade, Items X" Mdn. S F N 



Reading 2 45 39.6 41.7 5.9 .88 31,579 

Reading 4 60 48.2 50.9 -9.4 .80 33,589 

Reading 6 60 45.9 48.2 9.2 .77 31,060 

Reading 8 60 49.5 51.6 7.7 .83 32,067 

Reading 11 57 50.1 51.5 5.6 .88 30,881 



Mathematics 


2 


45 


42. 


6 


43, 


.5 


3. 


,6 


.95 


31,284 


Mathematics 


4 


60 


49. 


5 


52, 


.9 


9, 


.7 


.83 


33,576 


Mathematics 


6 


60 


47. 


6 


50, 


.3 


10. 


.0 


.80 ■ 


31 ,037 


Mathanatics 


8 


60 


45. 


9 


48, 


.7 


11, 


.1 


.77 


31 ,999 


Mathematics 


11 


57 


40. 


6 


42, 


.3 


10, 


.6 


.71 


30,752 



on 



